Approximate Bayesian Computation

Elizabeth King
Kevin Middleton

Approximate Bayesian Computation (ABC)

“At its heart, ABC is a very simple method: numerical evaluation of the likelihood function is replaced with an assessment of how likely it is the model could have produced the observed data, based on simulating pseudo-data from the model and comparing it to the observed data.” (Sisson et al. 2018)

ABC

Relationship to MCMC

  • Similar ideas (sampling from distributions of priors to give distributions of posteriors)
  • ABC works when the likelihood isn’t available (population genetics questions)

Basics of ABC

  • Write a model statement (function)
  • Choose an error term (MSE, MAE, etc.) or use the (log) likelihood
  • Draw candidate parameter values from distributions for each parameter
  • Evaluate the parameters given the model
    • Rejection sampling (choose a target number of samples a priori)
    • Collect all samples and evaluate later (sample and filter)

Simulate data

Gamma distribution

  • Common continuous for biological variables
  • Properties of a normal distribution
  • Positive, bounded at zero
  • Can have a long tail
set.seed(234798)
y <- rgamma(500, shape = 3, rate = 3)

ggplot(tibble(y), aes(y)) +
  geom_histogram(bins = 30, fill = "darkorange3")

Simulate data

Model statement and function

\[y \sim Gamma(shape,~rate)\]

gamma_sim <- function(ii, y) {
  shape <- truncnorm::rtruncnorm(1, a = 0, mean = 0, sd = 5)
  rate <- truncnorm::rtruncnorm(1, a = 0, mean = 0, sd = 5)
  log_lik <- sum(dgamma(y, shape, rate, log = TRUE))
  return(c(ii = ii, shape = shape, rate = rate, log_lik = log_lik))
}

gamma_sim(1, 1:10)
         ii       shape        rate     log_lik 
  1.0000000   2.6818141   0.2662371 -28.9345392 

Choosing the parameter “search” space

  • What are the likely values for the parameters?
    • What distributions?
    • What limits?
  • Similar to choosing Bayesian priors
  • Uniform priors will work fine

How large will my object be?

Be careful about how many samples you will take:

samples <- matrix(NA, nrow = 1e5, ncol = 4)
colnames(samples) <- c("ii", "shape", "rate", "log_lik")

format(object.size(samples), units = "Mb")
[1] "1.5 Mb"

ABC works best with loops

  • Too much overhead in the purrr/furrr functions

Generating samples

for (ii in seq_len(1e5)) {
  samples[ii, ] <- gamma_sim(ii, y)
}

head(samples)
     ii     shape      rate    log_lik
[1,]  1 11.206811  5.781687 -1596.7511
[2,]  2  6.452868  6.694615  -423.7230
[3,]  3  3.522727  7.328538  -994.0511
[4,]  4  6.905452 10.720679  -852.5172
[5,]  5  5.142266  5.288729  -381.0074
[6,]  6  4.010726  0.539865 -2619.3155

Processing samples

samples <- samples |> 
  as_tibble() |> 
  arrange(desc(log_lik)) |> 
  slice(1:5000)

head(samples)
# A tibble: 6 × 4
     ii shape  rate log_lik
  <dbl> <dbl> <dbl>   <dbl>
1 94107  3.47  3.43   -354.
2 69478  3.46  3.40   -354.
3 13372  3.45  3.41   -354.
4 44377  3.48  3.42   -354.
5 13074  3.41  3.36   -354.
6 57396  3.48  3.41   -354.
samples |> summarise(across(2:3, ~ median(.x)))
# A tibble: 1 × 2
  shape  rate
  <dbl> <dbl>
1  3.50  3.48

Visualizing samples

samples |> 
  select(shape, rate) |> 
  pivot_longer(cols = everything()) |> 
  ggplot(aes(value)) +
  geom_density() +
  facet_grid(name ~ .)

Visualizing samples

Change the model

gamma_sim_uniform <- function(ii, y) {
  shape <- runif(1, 0.1, 10)
  rate <- runif(1, 0.1, 10)
  log_lik <- sum(dgamma(y, shape, rate, log = TRUE))
  return(c(ii = ii, shape = shape, rate = rate, log_lik = log_lik))
}

for (ii in seq_len(1e5)) {
  samples[ii, ] <- gamma_sim_uniform(ii, y)
}

samples <- samples |> 
  as_tibble() |> 
  arrange(desc(log_lik)) |> 
  slice(1:5000)

samples |> summarise(across(2:3, ~ median(.x)))
# A tibble: 1 × 2
  shape  rate
  <dbl> <dbl>
1  3.89  3.90

Visualizing samples

samples |> 
  select(shape, rate) |> 
  pivot_longer(cols = everything()) |> 
  ggplot(aes(value)) +
  geom_density() +
  facet_grid(name ~ .)

Visualizing samples

Rejection sampling

  • Choose a minimum threshold to keep a sample
  • If the error term is lower, keep
    • Otherwise reject
  • Can determine the output size ahead
    • Use matrices for memory saving

Challenging models

  • “Intractable” likelihoods
    • Flat or oddly shaped surfaces
    • Discontinuous likelihood
  • Nonlinear models
  • Multiple optima

Non-linear model: Gompertz growth equation

\[y(t) = a e^{-b e^{-c t}}\]

  • a is the asymptotic size
  • b is the \(x\) displacement of the entire curve
  • c is growth rate
Gompertz <- function(t, a, b, c) {
  a * exp(-b * exp(-c * t))
}

Testing

GG <- tibble(t = seq(0, 25, length.out = 200),
             y = Gompertz(t, a = 5, b = 1, c = 0.5))

ggplot(GG, aes(t, y)) +
  geom_line(linewidth = 1) +
  scale_y_continuous(limits = c(0, 6))

Testing

Simulate data

set.seed(436578)
GGsim <- tibble(t = runif(20, min = 0, max = 25),
                y = Gompertz(t, a = 5, b = 1, c = 0.5) +
                  rnorm(20, 0, 0.25))

ggplot() +
  geom_line(data = GG, aes(t, y), linewidth = 1) +
  geom_point(data = GGsim, aes(t, y), size = 3) +
  scale_y_continuous(limits = c(0, 6))

Simulate data

Set up for rejection sampling

  • Pre-determine number of samples
  • Decide on error term
  • Decide on minimum error term
samples <- matrix(NA, nrow = 1e3, ncol = 5)
colnames(samples) <- c("ii", "a", "b", "c", "MAE")

min_MAE <- 0.5

Sampling: while() loops

total_iterations <- 0
sample_counter <- 1

tic()
while (sample_counter <= nrow(samples)) {
  total_iterations <- total_iterations + 1

  a <- truncnorm::rtruncnorm(1, a = 0, mean = 5, sd = 2)
  b <- rnorm(1, 0, 4)
  c <- rexp(1, rate = 5)
  
  predicted <- Gompertz(t = GGsim$t, a, b, c)
  MAE <- mean(abs(GGsim$y - predicted))
  
  if (MAE <= min_MAE) {
    samples[sample_counter, ] <- c(sample_counter, a, b, c, MAE)
    sample_counter <- sample_counter + 1
  }
}
toc()
0.465 sec elapsed
total_iterations
[1] 44066

Samples

samples[1:10, ]
      ii        a         b          c       MAE
 [1,]  1 5.104372 2.9192575 0.57257715 0.3769681
 [2,]  2 4.860579 0.4039046 0.20875071 0.3054701
 [3,]  3 5.041581 0.1805600 0.07335673 0.3854870
 [4,]  4 5.556490 1.8963527 0.59041063 0.4885992
 [5,]  5 4.787738 4.5050994 0.80856828 0.4753298
 [6,]  6 4.526859 1.5379045 0.79077532 0.4348779
 [7,]  7 5.530129 1.6338703 0.39480792 0.4797085
 [8,]  8 5.140731 1.3861181 0.44430259 0.2226250
 [9,]  9 4.490962 1.2919265 0.77094821 0.4559979
[10,] 10 4.864262 1.9311239 0.56906294 0.3441100
(pars <- samples |> as_tibble() |> summarise(across(2:4, ~ median(.x))))
# A tibble: 1 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1  5.22 0.668 0.243

Visualizing samples

samples |> 
  as_tibble() |> 
  select(a, b, c) |> 
  pivot_longer(cols = everything()) |> 
  ggplot(aes(value)) +
  geom_density() +
  facet_grid(name ~ ., scales = "free")

Visualizing samples

Visualizing the model

Timing

MAE Total Samples Time (s)
1.00 11654 0.31
0.75 20373 0.51
0.50 53788 1.25
0.40 105585 2.43
0.25 642067 14.61
0.20 2282010 50.31

Timing

ABC Pros and Cons

Pros

  • Works very well for all kinds of problems
  • Works fine with multimodal distributions

Cons

  • Slow and (very) inefficient
  • Choice of prior can have a large effect (as in Bayesian inference)

References

Beaumont, M. A., W. Zhang, and D. J. Balding. 2002. Approximate Bayesian Computation in Population Genetics. Genetics 162:2025–2035.
Csilléry, K., M. G. B. Blum, O. E. Gaggiotti, and O. François. 2010. Approximate Bayesian Computation (ABC) in Practice. Trends Ecol. Evol. 25:410–418.
Pritchard, J. K., M. T. Seielstad, A. Perez-Lezaun, and M. W. Feldman. 1999. Population Growth of Human Y Chromosomes: A Study of Y Chromosome Microsatellites. Mol. Biol. Evol. 16:1791–1798.
Sisson, S. A., Y. Fan, and M. A. Beaumont. 2018. Handbook of Approximate Bayesian Computation. CRC Press, Taylor; Francis Group.
Tavaré, S., D. J. Balding, R. C. Griffiths, and P. Donnelly. 1997. Inferring Coalescence Times from DNA Sequence Data. Genetics 145:505–518.
Turner, B. M., and T. Van Zandt. 2012. A Tutorial on Approximate Bayesian Computation. J. Math. Psychol. 56:69–85.
Weiss, G., and A. von Haeseler. 1998. Inference of Population History Using a Likelihood Approach. Genetics 149:1539–1546.